dplyr verbs
for transforming data frames
Example datasets:
nycflights13 flight data
Includes data on all departing flights from JFK, LGA, and EWR for all of
2013
nycflights13 weather data
Includes hourly weather data at JFK, LGA, and EWR for all of 2013
Start by running this code (chunk: load packages)
filter() to keep or omit specific
rowsfilter() is equivalent to subsetting or indexing in base
R
filter() in actionLet’s say we only want data for a specific airport from the flights
data set
filter() goes through each row/observation and checks to
see if it meets the criteria you’ve specified
If yes, that row/observation remains in the resulting data frame; if no,
that row/observation is removed
What is this line of code doing? (chunk: filter 1)
In this example, filter is keeping all rows/observations that have “JFK” in the origin column and removing all others
filter() in actionWe can do the same with numeric variables
What is this line of code doing? (chunk: filter 1)
Here, we want to keep only those flights that departed in the
morning
filter(), cont.Options for filtering:
< less than<= less than or equal to> greater than>= greater than or equal to== exactly equal to!= not equal to& and| or! notis.na(variable) or
!is.na(variable)
%in%
filter()How do we complete these lines of code to filter to all major airports in Texas (IAH, DFW, HOU)? (chunk: filter 2)
slice()slice() is similar to filter(), but you
select rows based on their position in the dataset rather than by some
condition
What is this line of code doing? (chunk slice 1)
This code pulls out rows 500 - 510 in the flights dataset
slice()You can use slice_head() and slice_tail()
to replace base R’s head() and tail()
functions and specify the number of rows you want to see
Complete these lines of code to view the first 6 rows and the last 12 rows of flight data (chunk: slice 2)
arrange()arrange() allows us to sort/order our dataset by
specific variables
What is this line of code doing? (chunk: arrange 1)
This code arranges flights by their departure time, from smallest to largest
arrange()The default direction of sorting is ascending, but we can use
desc() for descending
How can we alter this code to arrange flights by departure time, largest to smallest? (chunk: arrange 2)
arrange()You can also select multiple variables in the order you want to sort
them
Write a statement that arranges the flight data by arrival time in descending order, departure time in ascending order, and day in ascending order (chunk: arrange 3)
Use pipes + what we’ve just learned to answer the following
Use pipes + what we’ve just learned to answer the following
Use pipes + what we’ve just learned to answer the following
Use pipes + what we’ve just learned to answer the following
Use pipes + what we’ve just learned to answer the following
select()select() is similar to filter(), but for
columns instead of rows
You can select columns by name, position, or other characteristics
Complete this code to select the origin, destination, and distance columns (chunk: select 1)
select()Complete this code to select the columns 13, 6, 2, and 8 (chunk: select 2)
select()Complete this code to select all columns except tail number and time/hour (chunk: select 3)
select()What is this code doing? (chunk: select 4)
This code selects columns that contain the string “dep”
mutate()mutate() is a powerhouse function for data manipulation
- it allows you to make new variables
We’ll cover mutate() in a basic sense now and delve into it
more in the next module
mutate()You can use mutate() to create or calculate new
variables from existing variables
What is this code doing? (chunk: mutate 1)
Here we’re creating a new variable called speed that’s
calculated by dividing values in the distance column by
values in the air_time column
The new column is added to the end of the dataset
We’ll be learning more about functions to use with
mutate() in the next module!
mutate()Write a line of code that creates a new column called kilometers, which is calculated by multiplying the distance in miles by 1.609 (chunk: mutate 2)
relocate()relocate()We can use relocate() to move one or more columns to a
different position in our dataset, specifying which column they should
go before or after
Complete this line of code to relocate any columns with the string “wind” to appear after the precipitation column (chunk: relocate 1)
rename() to rename columns in your
datasetrename() is exactly what is sounds like - it renames
columns/variables in a dataset
Even if we use clean_names(), there may still be some
changes we want or need to make to certain column names
rename()The argument structure is new_name =
old_name
Complete this code to rename the dewp column to dew point (chunk: rename 1)
Use pipes + what we’ve learned so far to answer or accomplish the following
Use pipes + what we’ve learned so far to answer or accomplish the following
Use pipes + what we’ve learned so far to answer or accomplish the following
total_delay that adds together
the departure and arrival delays and put it in front of the distance
column, andgroup_by()group_by() lets us group our dataset by one or more
variables or columns
This allows us to then calculate summary statistics for specific
groups
For example, if we have a dataset containing the height and age of
different trees,
we can group by the age variable and calculate the mean height for each
age group
group_by() always needs to operate with
summarize(), unless you want to summarize your
whole dataset down to a single value
summarize()summarize() is similar to mutate() but
operates on grouped data
This allows us to do things like calculate a mean for each group or get
the number of observations for each group
group_by() and summarize() are always used
together
What is this code doing? (chunk: group summarize 1)
Here we are grouping by both origin and month, then using
summarize() to create a new column that calculates the mean
temperature for each airport each month
summarize()Complete this code to create a table that counts the number of flights from each origin (chunk: group summarize 2)
Use pipes + what we’ve learned so far to answer or accomplish the following
[HINT: check out the function
top_n()]
Use pipes + what we’ve learned so far to answer or accomplish the following
weather %>% group_by([ADD CODE HERE], month, day) %>%
summarize(min_temp = [ADD CODE HERE]) %>%
group_by([ADD CODE HERE]) %>%
filter(min_temp == [ADD CODE HERE])
flights %>% filter(day == [ADD CODE HERE] & month == [ADD CODE HERE]) %>%
group_by([ADD CODE HERE]) %>%
summarize(count = [ADD CODE HERE])Use pipes + what we’ve learned so far to answer or accomplish the following
Use pipes + what we’ve learned so far to answer or accomplish the following
Use pipes + what we’ve learned so far to answer or accomplish the following
Use pipes + what we’ve learned so far to answer or accomplish the following
Packages: dplyr
Data transformation with dplyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf